Dealing with Life Course Data in Demography: Statistical and Data Mining Approaches
نویسندگان
چکیده
This paper has essentially a methodological purpose. In a first section, we shortly present demography and historical demography, the intimacy between those two disciplines and their common intellectual history, the crisis they experimented in the 1980s, and how the life course paradigm and methods have been implemented to face up the challenge of shifting “from structure to process, from macro to micro, from analysis to synthesis, from certainty to uncertainty” (Willekens, 1999, pages 2629). This retrospective look also shows impressive progresses to promote a real interdisciplinarity in population studies, family demography being probably the best example. However, we also note that the success of multivariate causal analyses has been so rapid that some pitfalls are not always avoided. In Section 2, we focus on the study of transitions. First, readers mind is refreshed about regression models, then we discuss and illustrate the problem of population heterogeneity, how it could affect results interpretation, and the interest of robust estimates and the notion of shared frailty to deal with. We also present a less popular method than event history models, however well suited for studying states observed at periodic time, the Markovian models. In Sections 3, we face the gap that we observe between standard demographic analysis and causal research, i.e. the deficit of knowledge on trajectories. The developing field of data mining provides useful tools to fill such a gap and we would like to promote their use. 1 Life course approach in demography and historical demography Probably no discipline is more dependent of an unique model than demography. This model, if not a law, is the one of the demographic transition, elaborated by Landry and Notestein in the 1930s and 1940s. It offered a comprehensive reading of past, present and future of world population, including the disparity between developed and developing countries. Such a frame outdated both geographical and chronological boarders. Two examples are especially famous. Princeton demographers studied the fertility decline in nineteenth century Europe to find solutions that could be applied to the so-called “third world” (Coale and Watkins, 2 G. Ritschard and M. Oris 1986). Louis Henry, a French demographer, was at the origin of historical demography tremendous development in the 1960s/1970s when he invented the family reconstitution methods to observe a “natural” pre-transitional fertility among the eighteenth century European rural population (Henry, 1956). Since fifty years it is quite common for demographers to work on the past and for historical demographers to work on the present, if not the future. In this paper, we stay close to this tradition, taking our illustrations indifferently in demography or historical demography. Those two disciplines share not only a foundling model but also, to a large extent, a similar intellectual history. Telling such history is of course not our purpose, but a rapid summary is important to see when and why the life course paradigm and longitudinal methods emerged. Dealing with structures and flows, demography has been a science of reconstruction and description of patterns and behaviors, through a well-established quantitative methodology, and the conviction that higher the number of observations, more accurate and possibly useful were the results (Pressat’s manuals remain classical for generations). Demography was a science of the masses, growing or stagnating, young or old, not of the individuals. In the same time, the engagement of generations of scholars was largely motivated by the central character of population issues and the location of demography at a crossroad between economy, sociology, epidemiological studies, territorial analysis, political sciences and, more recently, cultural and gender approaches. However, research and collaborations were in reality highly segmented, with a clear tendency to specialization on a geographical and/or thematic basis (typically, mortality, fertility, marriage and family formation or dissolution, migrations, structures, prospective). Demography, and to a lesser extent historical demography, hesitated between the temptation of autonomy, often associated with a closing on its quantitative core, and its disappearance within the social sciences, with in-between the development studies or the “population sciences”. A real intellectual crisis resulted from such hesitation, as well as from the frustration against segmentation, and also from a growing conscience that description, especially some quantification with a pretension of objectivity, hid and diffused ideological visions about what could be a “good” or “optimal” population (Véron, 1993). Among the many reactions, revisions and re-examinations, new approaches and new methods progressively emerged. Something that retrospectively could seem very strange but is a perfect illustration of our assertions in the preceding lines, is the discovery in the 1980s of an almost complete absence of dialogue between demography and family sociology. While family is the place where most of the demographic behaviors took place and, to some extent, are decided, “few textbooks on population contain a chapter devoted to the demography of the family. Where such chapter does exist, it is generally shorter and more superficial than those that deal with fertility, mortality, nuptiality, and migration, or with the dynamics of age structure” (Höhn, 1992, p. 3). In 1982, the International Union for the Scientific Study of Population created an ad hoc committee to develop its study, but still in 1992 the animators of this group Life Course Data in Demography 3 saw family demography as “a recent and relatively underdeveloped branch of population studies” (Berquo and Xenos, 1992, p. 8). Its development has been extraordinary in the last years and is part of a shift from macro to micro, from an emphasis on macro-economic changes as the essential determinant of demographic changes to a multi-causal multivariate approach of behaviors, from average results to the study of distributions. In a quantitative discipline, major evolutions necessarily imply to take up technical challenges. “The traditional demographic analysis of such events as births, marriages, divorces, deaths, and migration, has the advantage that numbers of these events can be related to individuals in the same age group and can, therefore, be measured more easily and included in models. The inclusion of other family members in such analyses causes difficulties because they will generally differ in age and sex, and complications are also introduced because they do not generally live together continuously” (Höhn, 1992, p. 3). Although several attempts have been done to construct a “household demography” (Van Imhoff et al., 1995), the life course paradigm clearly imposed itself. Offering both concepts and statistical methods in an explicitly interdisciplinary perspective, it deeply renew the discipline, representing a shift toward micro analysis of individual data and causal research (Dykstra and van Wissen, 1999). A first substantial gain has been the study of multiple events, marriage and first birth, or moving and starting a new job for instance, a kind of investigation that also raise the issue of event sequencing and interactions that is typically treated with event history analysis. If people have several careers that they must make compatible, their life transitions also reflect socio-economic constraints, cultural norms (about the “proper” age, sex or behavior), as well as compromises between several individual aspirations within or beyond the domestic unit. Through researches in this huge area, family demography made for sure tremendous progress during the last 20 years. Our strong feeling is that the shift has been so sudden that globally the complexity of causalities is too often under-estimated (see especially Courgeau and Lelièvre, 1993; Blossfeld and Rowher, 2002; Bocquier, 1996; Alter, 1998), as well as several technical traps. The problem is essentially that when studying a population of individuals observed along the time, since each life, product of complex and multiple interactions, is as a matter of fact unique, interpreting and generalizing from samples requires several cautions. In the next section, we remind the main event history regression models and discuss the question of heterogeneity. We cannot consider that the elaboration of indicators at an individual levels about household, family and community contexts is enough to deal with the more and more raised issue of “linked” or “interdependent” lives (Hagestad, 2003). We show the interest of robust estimates and shared frailty in that perspective. In the same section, we also present the Markovian models that are particularly useful for the study of transitions within a set of states (social status, for example) periodically observed. In the interdisciplinary perspective that is the one of life course, we consider important to go beyond the simple transitions typically studied in demography (from single to married, from a first 4 G. Ritschard and M. Oris to a possible second child, from life to death, and so on) and to investigate how, from a starting position, a destination is selected among several possible. While family dynamics and life courses are more and more open, such investigations are essential to deal with the characterization of transitions as “normal” or “nonnormal” without falling again in the trap of ideological reading (see, for example, Oris and Poulain, 2003). Indeed, we assess more globally that between aggregate descriptions and causal analysis there is an obvious deficit of research on trajectories. Regression models indicate the probability that a factor, measured by an indicator, affect a risk, but such results tell us nothing about the calendar and no more about the alternatives to this risk in life courses. It is essential to look carefully at transitions in trajectories to properly target a causal analysis, and this step is clearly too often superficial, if not absent. Several methods, recently developed or recently made available in statistical packages, offer opportunities to fill this gap. In Section 3, we introduce the data mining approach, especially mining event sequential association rules and the use of induction trees. 2 Statistical modeling of life events Life courses data are longitudinal in their essence. Here, we focus on events, an event being the change of state of some discrete variable, e.g. the marital status, the number of children, the job, the place of residence. Such data are collected in mainly two ways: as a collection of time stamped events or as state sequences. In the former case, each individual is described by a collection of time stamped events, i.e. the realization of each event of interest, e.g. being married, birth of a child, end of job, moving, is mentioned together with the time at which it occurred. In the second case, the life events of each individual are represented by the sequence of states of the variables of interest. Panel data are special case of state sequences where the states are observed at periodic time. The first kind of data is typically analysed with event history regression methods, while methods for state sequence analysis like Markov transition models are best suited for the latter. We briefly discuss hereafter the scope and limits of these approaches. 2.1 Event history regression models When we have time stamped events, the question of interest is the duration of the spell between two successive events, or somewhat equivalently the hazard rate h(t) for the next event to occur precisely after a duration t, i.e. the conditional probability for the event to occur at t knowing that it did not occur before t. Longitudinal regression models focus on this aspect. They express either the duration or the hazard rate as a function of covariates. There are continuous time models and discrete time forms. With continuous time, the main formulations (see Blossfeld et al., 1989; Courgeau and Lelièvre, 1993) are as a duration model, T (x1, . . . , xp) = T0 exp(β1x1 + · · ·+ βpxp) , Life Course Data in Demography 5 or as a proportional hazard model h(t, x1, . . . , xp) = h0(t) exp(β1x1 + · · ·+ βpxp) . The former, also known as the accelerated failure time model, assumes usually an exponential, Weibull, log-normal, log-logistic or gamma distribution for T . The proportional hazard model is compatible with for instance, exponential, Weibull and Gompertz duration distributions. It includes also the perhaps most widely used Cox (1972) semi-parametric model that requires no assumptions on the form of the duration distribution. Most statistical packages (SAS, S-Plus, Stata, R, TDA, ...) provide procedures for estimating such models. SPSS, however, offers only support for the Cox model. Discrete time models (see Allison, 1982; Yamaguchi, 1991) include the proportional hazard odds ratio model, also due to Cox (1972), ht(x1, . . . , xp) 1− ht(x1, . . . , xp) = ht(0) 1− ht(0) exp(β1x1 + · · ·+ βpxp) and the log-rate model (Holford, 1980) ht(x1, . . . , xp) = exp(β1x1 + · · ·+ βpxp) In the latter case, the xi’s are usually dummies coding categorical variables, their interactions and, possibly, interactions with t. For the estimation of the proportional hazard odds ratio model, some assumptions are usually required upon the baseline hazard odd. Letting β0t be the baseline log-hazard ln[ht(0)/(1 − ht(0))], the most current assumptions are β0t = β0 (constant), β0t = β0t (linear with t, Gompertz), β0t = β0 ln t (linear with ln t, Weibull). With these assumptions, a proportional hazard ratio model can, if we organize the data in a person-period form, simply be estimated as a logistic regression. Hence, it can be estimated by any software that proposes logistic regression. Likewise, a log-rate model can be estimated with any log-linear model procedure that allows for weighted cell frequencies. Indeed, the log-rate model is a log-linear model of the weighted number of events occurring in a time interval, the weight being the inverse of the population at risk in this interval. A common issue with the time to event models is the handling of censored data. Censored data occur when the observed start (left) and/or end (right) time of a spell are not its actual start and end time. For instance, if we observe job duration, some jobs may not be terminated at the time of the survey and are hence right censored. Though no event is recorded at the end of the right censored spells, these cases are taken into account by entering the population at risk for job length lower or equal to the observed duration. Another issue is the handling of time varying covariates. The solution is quite straightforward in the discrete time setting that works on person-time data. For the continuous case, there are two major solutions: an ad-hoc extension of the Cox model that allows for discrete time varying covariate and the episode-splitting approach. (See for instance Blossfeld and Rowher, 2002, for details.) For more advanced developments of the Cox model see Therneau and Grambsch (2000). 6 G. Ritschard and M. Oris This event history modeling, especially the Cox proportional hazard and the Cox discrete time proportional hazard odds ratio models, has become popular among demographers. Together with other social science scientists, historical demographers have to face issues like competing events (multiple destinations), repeatable events and interacting events. The first two can easily be handled with a software like TDA (Rohwer and Pötter, 2002) that supports episodes defined by 4 parameters, namely the origin state, the start time, the destination state and the end time. The interaction between events, marriage and first child for instance needs a simultaneous equation approach that has been investigated for instance by Lillard (1993). Shared heterogeneity and multi-level modeling. A further issue of importance, shared heterogeneity, has to do with the sampling nature of the data. These are often clustered, i.e. the individual data come from a selection of groups, parishes or families for example. In such cases, members of a same group share a same contextual framework and it is then of primary importance to distinguish effects that hold at the group level from those that work at the individual level.
منابع مشابه
Life Course Data in Demography and Social Sciences: Statistical and Data Mining Approaches
This paper has essentially a methodological purpose. In a first section, we shortly explain why demographers have been relatively reluctant to implement the life course paradigm and methods, while the quantitative focus and the concepts of demographic analysis a priori favored such implementation. A real intellectual crisis has been needed before demographers integrated the necessity to face up...
متن کاملTown trip forecasting based on data mining techniques
In this paper, a data mining approach is proposed for duration prediction of the town trips (travel time) in New York City. In this regard, at first, two novel approaches, including a mathematical and a statistical approach, are proposed for grouping categorical variables with a huge number of levels. The proposed approaches work based on the cost matrix generated by repetitive post-hoc tests f...
متن کاملAccuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)
Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...
متن کاملApplication of Rough Set Theory in Data Mining for Decision Support Systems (DSSs)
Decision support systems (DSSs) are prevalent information systems for decision making in many competitive business environments. In a DSS, decision making process is intimately related to some factors which determine the quality of information systems and their related products. Traditional approaches to data analysis usually cannot be implemented in sophisticated Companies, where managers ne...
متن کاملPrinciples and Prospects of the Life Course Paradigm
In recent years, the scientific interest of many demographers has shifted from studying “demographic regimes” and large-scale processes to analysing longitudinal micro data in the form of “life courses”. By some of its advocates, the “life course approach” is heralded as a new paradigm capable of reinvigorating the study of populations. Since, ultimately, demography deals with the fates and cho...
متن کاملThe Effects of Taking the Life Skills Course on First Year High School Students\' Self-Efficacy and Assertiveness
The Effects of Taking the Life Skills Course on First Year High School Students' Self-Efficacy and Assertiveness M. Parto, Ph.D. To assess the effectiveness of the life skills course in improving students' self-efficacy and assertiveness two student samples were drawn from among all first year high school students in Tehran, one from schools offering the course and the other from...
متن کامل